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Abstract  -  An  algorithm  of  Charles  S.  Peskin  for  simulation  of 
incompressible  flow  of  a  fluid  coupled  to  an  elastic  membrane,  was 
implemented  as  a  program  for  an  asynchronous  shared  memory  parallel 
computer,  the  "NYU-Ultracomputer".  The  algorithm  is  an  iterative 
repetition  of  a  cycle  consisting  of  recomputing  the  position  of  the 
membrane  and  the  velocity  of  the  fluid  for  successive  descrete  times. 
Each  iteration  requires  solving  Navier-Stokes  equations  for  the  fluid, 
using  an  FFT  and  algorithms  for  solving  linear  equations  and  also 
includes  an  iterative  minimization  of  a  tension  functional  for 
recomputation  of  the  new  position  of  the  membrane.  All  these  phases 
are  parallelized  by  spawning  a  number  of  independent  tasks  and  waiting 
for  all  of  them  to  terminate.  The  program  includes  subroutines 
REQUEST (INDEX)  and  SYNCH  which  provide  coordination  necessary  for  such 
a  spawning.  Timing  analysis  of  simulated  executions  predicts  high 
efficiency  of  this  algorithm  for  parallel  computers  that  include  up  to 
hundreds  of  processing  elements  in  the  2-D  case  and  up  to  thousands  of 
processing  elements  in  the  3-D  case. 
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Program  of  the  U.S.  Department  of  Energy  under  Contract  No. 
DE-AC02-76ER03077,  and  in  part  by  the  National  Science  Foundation  under 
Grant  No.   NSF-MCS79-21258. 
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1.  Introduction 


Charles  P.  Peskin  has  developed  [Pes]  a  serial  algorithm  for 
simulation  of  incompressible  flow  of  a  fluid  coupled  to  an  elastic 
membrane  and  has  suggested  ways  of  parallelizing  this  algorithm 
[UCN18,  UCN19,  UCN20].  Note  that  serial  simulation  of  one  pulsation  in 
the  2-D  model  of  blood  flow  through  heart  valves  consumes  about  one 
hour  of  CDC-6600  CPU  time.  Essential  speed-up  of  these  computations  is 
vitally  important   for  further  study  and  for  the  extension  to  the  3-D 


case. 


The  present  note  describes  a  simplified  2-D  version  of  this 
algorithm  implemented  as  a  parallel  FORTRAN  program  for  the 
"NYU-Ultracomputer"  [UCN40],  an  asynchronous  shared  memory  parallel 
machine.  Note  that  these  simplifications  mostly  affected  the 
initialization  part  of  the  program  concerned  with  the  representation  of 
the  membrane.  In  our  simplified  version  the  membrane  was  a  stretched 
ellipse,  whereas  in  the  realistic  version  it  represents  the  heart.  The 
most  time  consuming  "physics"  of  the  simulation  was  practically 
untouched.  Parallel  execution  of  this  program  was  obtained  using 
simulators  [UCN12,  UCN28]. 

The  algorithm  is  the  iterative  repetition  of  a  cycle  consisting  of 
recomputing  position  of  the  membrane  and  fluid  velocity  for  successive 
descrete  times.  This  cycle  includes  several  phases:  solving 
Navier-Stok.es  equations  for  the  fluid  using  an  FFT ,  solving  linear 
equations,   and   performing   an  iterative   Newton's   minimization  of  a 
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tension  functional  for  recomputation  of  the  new  position  of  the 
membrane.  The  system  of  linear  equations  arising  in  this  minimization 
is  solved  using  odd-even  reduction  method.  Each  phase  was  parallelized 
by  spawning  a  number  of  independent  tasks  and  waiting  for  these  tasks 
to  terminate.  The  program  includes  subroutines  REQUEST (INDEX)  and 
SYNCH  which  coordinate  the  spawning  and  termination. 

This  note  describes  the  structure  of  the  parallel  algorithm, 
presents  a  timing  analysis  of  simulated  executions  in  the  2-D  case,  and 
predicts  efficiency  of  the  algorithm  in  the  3-D  case. 


2.  Serial  algorithm. 

The  2-D  problem  is  described  as  follows.  (N^  +  I)  ^  (N^  +  1)  mesh 
nodes  are  placed  uniformly  onto  a  square  of  size  t  >^  t  •  In  this  square 
the  field  of  flow  velocities  and  the  membrane  positions  are  computed. 
Periodic  boundary  conditions  are  assumed,  i.e.  the  square  is  thought 
of  as  representing  a  torus. 

At  the  beginning  of  time  step  n  the  following  data  are  available: 
(i)  an   array   of   size   N.    N^   of   flow  velocity  vectors  ^±2' 
i»j  =  0»1 » •  •  •  jNj^-l ;  i.e.   a  vector  for  each  mesh  node; 

(ii)  an  array  of  size  N„  of  vectors  of  membrane  positions  X^, 
k  =  0,...,N2-1;  i.e.  a  two-dimensional  position  for  each  membrane 
point. 
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Membrane  positions  are  thought  of  as  material  points  fixed  on   the 
elastic   boundary,   so   that  an  increase  of  the  distance  between  two 
consecutive  membrane  points  X^   and  X^'   (k'  =  (k+l)modN2)   beyond   some 
threshold  As  causes  elastic  forces  to  increase. 

Peskin's   algorithm  computes  uj"*"^  and  Xg'^l  through  the  following 
sequence  of  phases: 

Phase  1.  Interpolate  fluid  velocity  field  from  the  mesh  nodes  onto   the 
membrane  positions: 

(2.1)  Un  =  j;  u^.  D.  j(X^)  (Ax)2 

i.j 
where: 

Ax  =  £/N^  is  the  mesh  width; 

UP  is  velocity  vector  of  fluid  flow  at  membrane  positions  #k, 

k  =  0,...,N2-1; 

Dji^.(X),  X  =  (x,y),  is  the  two-dimensional  weight  function  on 
the  mesh, 

(2.2)  D.^(x,y)  =  6^^(x-iAx)  x  &i^yiy-2^y) , 

where  one-dimensional  weight-function  6  is  given  by  expression 
6^(x)  =  __  (1  +  cos  _— )  for  lx|  _<  2h  and  <5^(x)  =  0  otherwise 
(coordinates  x  and  y  are  understood  modulo  £). 


Phase  2.  Find  set  of  auxiliary  membrane  positions  Xj^  that  minimizes  the 
following  tension  function 


(2.3)  $(Xi,...,Xn  )  =  i  I  IX^  +  At  Ug  -  X^l^   +  X  E(Xi,...,Xn  ), 
^      k  ^ 
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where: 


At,  the  length  of  a  time  step,  and  X   are  positive  parameters; 
function  E(X,  ,..,,Xj^  )   is  the  elastic  energy  in  the  membrane, 
defined  as 

(2.4)  ^   ^T     ^   ^k^  ^s» 

k 
where  As  is  the  distance  between  any   two   succesive  membrane 

points   in  a   relaxed  membrane  and  C  are  positive  parameters*, 

and 

l^k'  "  ^k'  "  ^^ 

^k  =  °^^  ( xi •   °^ 

if  k  and  k'  are  two  succesive  points  on  the  membrane. 


The  Newton's  minimization  of  $(Z),  Z  =  (X.  ,.,,Xfj  ),   is   effected 
as  follow: 

i)  Compute  initial  approximation  Z^  as  X^  +  At  UP; 

ii)  For  each  successive  iteration  Z^,  r  =  1,2,...,  the  function  *(Z)  is 
approximated  by  a  quadratic  function  $  (Z)  whose  value,  first  and 
second  derivatives  are  the  same  as  of  $(Z)  at  Z  =  Z^ ; 

iii)  The  minimum  of  $  (Z)  is  found  by  solving  a  system  of  linear 
equations  with  a  pseudo-tridiagonal  matrix  of  order  2N„  x  2N2.  (This 
matrix  consists  of:  a  main  non-zero  diagonal  of  2  x  2  blocks;  the  two 
diagonals  of  such  blocks  adjacent  to  the  main  diagonal;  the  two  2x2 
blocks  in  top-right  and  bottom-left  corners).  Each  such  a  system  is 
solved  using  odd-even  reduction  method,  for  which  purpose  N2  must  be  a 
power  of  2  (see  [UCN19]). 


*At first  glance  it  seems  that  C  might  be  incorporated  in  X  in  (2.3) 
and  replaced  by  1  in  (2.4).  This  however  is  not  convenient  for  the 
following  formula  (2.5), 
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Then  evaluate  set  of  membrane  forces 


(2.5)  f*  =  -^(X*.....X5^) 


Note  that  computations  of  the  coefficients  of  the  linear  system  to  be 
solved  at  each  minimization  phase  are  arranged  so  that  the  array  f,  is 
obtained  as  a  side-effect  of  the  computations  of  the  first  derivatives 
of  *(Z). 


Phase  3.  Compute  auxiliary  field  u  of  flow  velocities! 


(2.6)  u*  .  =   u?.  +  At  I   f*  Dij(X2)  As 

k 


Phase  4.  Solve  the   following   systems   of   linear  equations   for  the 
unknown  auxiliary  vector  fields  u   and  u    : 


(2.7x)  [I  +  At  (u^  dJx>  -  V  d|x)d(x))]  u 


**  _  * 

=  u 


(2.7y)  [I  +  At  (u"  ojy)  -  V  B^y^Divh]    u***  =  u** 


where: 


I  is  the  identity  operator; 

the  D's  are  finite  difference  operators  defined  as  follows:  if 
F(x,y)  is  a  scalar  field  and  (H^(x,y) ,Hy(x,y) )  is  a  vector 
field  on  the  square  whose  values  on  the  mesh  points  are 
F^  J.  =  F(iAx,jAy)  and  H^j^  ^  =  H^(iAx,jAy),  respectively,  and  if 
all  indexes  are  understood  modulo  N.  then 
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V    .  .     .    -  V  .     , 


^i+l,j  "  ^i-l.j  ^  3F 


^   "   ^'^  Ax2  3x2 


Similar  expressions   hold   for  y;  here  and  below  the  symbol  ~ 
means  "in  the  continuous  case  corresponds  to". 

Then  solve  the  following  system  of  linear  equations  with  respect   to 
Nj^  X  Nj^  values  of  an  auxiliary  (pressure)  field  p"   : 


(2.8)  D  G  p^+l  =  J_  D  u*** 

At 


where: 

D  H.  .  =  D^x^H  +  D^y)H      ~  DIV  H 
G  F^.    =  (dJx)f,  dJx)f)      ~  GRAD  F 

°^  ^ij  =  ^i+2.j  +  Fi.2,j  +  Fi,j_2  +  F,^j+2  -  ^  ^ii 

System  (2.8)  is  to  be  solved  in  three  phases: 

Phase  4.1.  Carry  out  an  FFT  with  parameter  oo  =  exp( )  on 

the  N.  X  N,  matrix  Du  (the  FFT  is  first  applied  to  rows, 
then  to  columns). 

Phase  4.2.  Solve  (2.8)  with  respect  to  the  FFT  images  (which  is 
a  scalar  division  of  each  of  the  N.  x  N,  results  of  phase  4.1 
at  a  polynomial  of  uj  for  each  element  of  the  FFT-transf orm  of 


Phase  4.3.  Apply  invert  FFT  of  the  results  in  the  same  way  as 
in  phase  4.1. 

Then  evaluate  the  velocity  field  u^  .  for  the  time  step  (n+1): 

(2.9)  u^+l  =  u***  -  At  G  pn+1 

Phase  5.  Compute  the  membrane  positions  at  the  time  step  /'(n+1): 

(2.10)  Xn+1  =  X^  +  At  r  u^-tl  D.  .(X^)  (Ax)^ 

i.j 

Note  that  only  a  fixed  (independent  of  N^)  number  of  terms  in  the  above 
sum  is  non-zero. 

This  completes  the  calculations  required  during  time  step  n. 


Figure  2.1  represents  a  sequence  of  5  successive  positions   of  an 

^k+ 1  ~  ^k 

initially  uniformly  stretched  ( =  const  >  1)  elliptic  membrane 

As 

for  the  first  half-period  of  its  oscillation  in  the  fluid.  Positions 
of  the  membrane  are  recorded  every  four  time  steps.  Here  N,  =  16, 
N2  =  64.   (The  mesh  and  the  fluid  flow  velocity  fields  are  not  shown.) 
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« 

3.  The  structure  of  the  serial  algorithm. 

The  computations   of   the   serial   algorithm   were   organized 
essentially  as  a  sequence  of  simple  loops  of  the  form 


DO  LA  J  =  1  TO  MULTIPLICITY 
(3.1)     CALL  TASK  (J) 
LA  CONTINUE 


where  LA  is  a  FORTRAN  label  corresponding  to  a  given  loop;  TASK  is  one 
of  the  subroutines,  listed  below. 

When  grouping  computations  into  tasks  the  following  restriction 
was  imposed.  Iterations  of  a  given  instance  of  loop  (3.1)  must  work  in 
separate  sections  of  memory.  To  reflect  the  special  high-performance 
hardware  being  developed  at  NYU  [UCN40]  we  permit  one  exception  to  this 
rule:  several  iterations  of  the  same  loop  may  increament  the  same 
variable  providing  no  other  use  is  made  of  this  variable  in  the  same 
loop.   We  call  such  kind  of  task  dependence  additive  interleaving  (AI). 

We  note  that  satisfying  the  restriction  that  tasks  must  be 
independent  or  at  most  be  AI  dependent  for  the  considered  algorithm  did 
not  present  any  difficulty  in  programming^. 


«The  case  when  tasks  of  the  same  loop  may  interleave  addition  to  some 
memory  locations  as  well  as  use  this  location  for  other  purposes  is 
generally  possible  (and  creates  many  interesting  theoretical  issues, 
see  for  example  L.  Umport.  The  parallel  execution  of  DO  Loops' 
Comm.  ACM,  1974,  Vol.17,  No. 2,  pp. 83-93).  The  program  in  question, 
however,  did  not  provide  an  opportunity  to  expand  on  this. 
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Now  we  give  the  details  of  the  two  cases  in  the  program  where 
organizing  of  loop  (3.1)  with  AI  was  ' prefered  over  complete 
independance  of  tasks. 

Case  1.  Task  USTAR  effects  computation  of  formula  (2.6)  corresponding 
to  one  value  of  k  in  and  updates  all  u^j  for  which  Dj^j(X^)  *  0. 
(There  are  values  of  16  u .  .  subject  to  updating  by  each  USTAR(k).) 

Case  2.  The  pseudo-tridiagonal   system  used  to  minimize  $  possesses 

two  symmetries:  sjnnmetry  I  with  respect  to  the  main  diagonal,  which  may 

be   expressed  in  the   form  A^  .  =  A^j^   for  2x2  blocks  Aj^j;   and 

symmetry  II,    which    may    be    expressed     in     the     form 

A.  .  ,  +  A.  .  +  A.  ^,,  =  1(2),   where   1(2)   is  the  2x2  identity  matrix 
1,1— i    i>i    1  >  J''"J- 

and  all  indexes  are  understood  modulo  N2.  Task  COMPS  computes 
coefficients   of   this   system  as   follows:   COIlPS(i)  computes  A    ,; 

X  ,  1   1 

effects  the  assignment  A.  ,  .  <-  A.  ^_i,  corresponding  to  symmetry  I; 
and  effects  the  two  AI  updatings  A  <-  A..  -  A.  ._•,  +  I(2)/2  and 
A._j^  ._^  <-  Aj^_^  ^_i   -   Aj__i  i  +  I(2)/2,  corresponding  to  symmetry  II. 

Another  criterion  considered  when  restructuring  the  serial 
algorithm  was  the  granularity  of  a  task.  There  are  two  opposing 
tendencies  each  improving  one  property  of  parallelization  at  the 
expence  of  the  other: 

(a)  for  coarser  granularities  the  relative  overhead  of  parallelization 
becomes  smaller  which  makes  computations  more  efficient; 

(b)  for  finer  granularities   the  degree  of   parallelization  becomes 
larger  which  makes  computations  faster. 
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Since  the  expected  overhead  of  spawning  is  at  most  few  hundreds 
cycles  per  task  (see  [UCN13]),  in  setting  the  granularity  we  tried  to 
make  the  task  complexity  at  least  as  large  as  hundreds  of  machine 
cycles.  Such  granularity  allows  the  effective  use  of  up  to  hundreds  of 
processing  elements  (PEs)  in  2-D  case  and  up  to  thousands  of  PEs  in  3-D 
case  (see  the  analysis  in  section  6).  Thus  a  3D-problem  would 
effectively  utilize  a  full  Ultracomputer. 

In  few  cases  the  body  of  the  task  was  placed  directly  inside  the 
loop  (3.1)  without  separating  it  as  a  subroutine.  These  cases  are 
referenced  to  in  the  table  3.1  as  "without  names".  These  tasks  are 
small  (up  to  100  cycles)  and  account  for  a  small  fraction  of  the 
computational  complexity  and  the  number  of  spawnings.  Parallelizing 
them  in  the  general  scheme  of  (3.1)  was  prefered  over  an  artificial 
increase  of  their  granularity,  since  the  latter  would  make  the  code 
very  complicated  and  difficult  to  read. 

Several  subroutines  are  invoked  more  than  once  at  each  time  step. 
For  example  an  FFT  is  invoked  two  times  (for  rows  and  then  for  columns) 

JL  JL  JL 

to  obtain  the  transform  of  Du  at  phase  4.1  and  then  two  more  times 
to  invert  the  FFT  at  phase  4.3.  Moreover  the  multiplicities  of  these 
invocations  are  not  the  same:  the  usual  multiplicity  is  N,  but  one 
invocation  requires  multiplicity  N^/2  +  1.  (For  the  Fourier-image  f ^ ^ 
one  has  f„  .  =  f-  ^,  where  f  is  complex  conjugate  to  f .  Instead  of 
computing  f ^  .  as  the  Fourier-transform  for  i  >  ___  it  is  more  economical 
to  compute  the  conjugation  of  f^  .j^  ^  -i^*  '^^^  total  numbers  of 
invocations   (in   this   example   3   and  1)  are  also  shown  at  table  3.1. 
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Note  that  the  specification  of  a  call  to  a  subroutine  in   such  a   case 
can  be  effected  by  parameters  additional  to  J  in  (3.1). 

In  two  cases  (subroutines  COMPS  and  REDUCT)  the  number  of 
invocations  is  not  known  in  advance  since  it  depends  on  the  number  of 
iterations  of  the  minimization  algorithm.  We  will  call  the  latter 
#iter. 

In  table  3.1  and  below  Ig  is  the  binary  logarithm. 
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1                        1  order  of   1               I  number  of 
name    |       function          | complexity]   MULTIPLICITY  (invocations 

1                      1          1               Iper  time  step 

KERN    1  compute  weight  function]      1     |       N2       |      1 
1  for  a  point  of  membrane]          I              1 

INTRVEL  1  interpolate  velocity   1      1     ]       N2      |     1 
]onto  a  point  of  membrane]           ]               1 
]   at   phase  1           j          1               1 

OSTAR    ]    effect   phase  3       ]      1     ]       N2       ]      1 

GE      1  Gausian  elimination  of  I     ^1    1       ^1      1     2 
1  a  triadiagonal  matrix   ]          I               ] 
1  of  solution  of  (2.7x)   ]          |              I 
1  per  row  of  mesh  and     ]          |              I 
1  of  solutions  of  (2.7y)  ]          |              | 
1  per  column  of  mesh     ]          |              I        • 

without  ]  compute     D  u***      ]      1     ]       N,       |      1 
name    ]  from  u***  at  phase  4    ]          |              | 

FFT     ]  phases  4.1  and  4.3     ]  N^  igN^   |       Nj      j     3 

1  for  a  row  or  for       |          1     Ni        1     ^        • 
1  a  column  of  mesh       ]  N,xlgNi   ]     -t""^^       '      ^ 

without  ]  phase  4.2  for  each  row  ]     N.    |       N,       1     1 
name     1  of     mesh            ]           ]               ] 

BOUNDP   1    effect  phase  5  for    ]      1     ]       N2      |     1 
]   a  point  on  membrane   ]           ]               1 

INITX    ]    compute  initial       1      1     |       N2       |      1 
i    approximation       |          |              I 
]     minimizing  <I>         |           |               | 

COMPS    ]  compute  coefficients    ]      1     ]       N2       ]    //iter 
1  of  a  linear  system  to   ]          |              1 
1  be  solved  at  each  step  ]          |              I 
]     minimizing   $       |           ]               ] 

REDUCT   ]  effect  one  step  of      ]      1     ]       N2       |  lgN2'<#iter 
]  odd-even  reduction  of   ]           1               | 
]  solution  of  the  linear  ]           |               | 
]  system  minimizing  "Ji     ]           |               1 

Table  3.1.  Characterization  of  various  tasks. 
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4.  Cycle  parallelization. 

4.1.  First  we  briefly  overview  the  "spawn-wait"   feature  proposed 
for   the  ultracomputer  operating   system^  in  [UCN13].   There  are  two 
system  primitives   SPAWN  and  WAIT  whose  invocation  sequences  are 
respectively: 
SPAWN  (TASK,  MULTIPLICITY,  LIST_OF_PARAMETERS)  and  WAIT  (TASK). 

The  invocation  of  SPA^^  causes  an  insertion  of  an  item  in  the 
operating  system  "ready"  queue.  This  item  contains  the  TASK  and  its 
parameters  and  a  specification  that  this  task  is  to  be  executed 
(concurrently)  a  number  of  times  equal  to  the  MULTIPLICITY.  The 
invocation  of  the  WAIT  causes  the  calling  process  to  wait  (or  suspend) 
until  all  the  named  tasks  are  completed  (and  the  corresponding  item  is 
deleted  from  the  queue).  The  standard  way  of  using  these  primitives  is 
the  following.  A  parent  process  first  spawns  a  number  of  children 
processes  (usually  for  the  execution  of  a  large  number  of  relatively 
independent  computations  that,  in  the  serial  program,  are  the 
iterations  of  a  loop).  Then  the  parent  process  calls  WAIT  and  thereby 
suspends  itself  until  all  the  spawned  computations  terminate. 


''Our  program  was  written  following  the  proposals  [UCN13].  Subsequently 
new  suggestions  have  appeared  (see  [UCN23,  UCN40]).  We  believe  the 
general  architecture  of  this  program  may  be  easily  adapted  to  these 
later  suggestions.. 
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4.2.  If  the  spawned  tasks  are  independent  (i.e.   each  task  updates 
only  locations  to  which  it  has  exclusive  access),   then   the  given 
organization  of  parallel  computation  is  safe. 

For  AX  tasks  USTAR  and  COMPS  we  used  the  floating  point  variant  of 
the  basic  indivisible  operation  Fetch-And-Add&. 

As  example  consider  task  USTAR  where  several  PEs  may  concurrently 
update  the  same  velocity  u.^.  To  avoid  conflicts  the  operation  '+'  in 
(2.6)  is  implemented  using  the  floating  point  FAA.  An  alternative 
solution  might  make  use  of  critical  sections  with  one  semaphore  used  to 
protect  addition  in  (2.6)  for  each  mesh  point.  Hashing  might  be 
employed  to  decrease  the  number  of  semaphore  variables.  The  idea  is  to 
use  one  CS  for  many  variables  u^^.  This  latter  solution  might  slow 
down  the  execution,  however. 

4.3.  Our  parallel  program  was  written  for  the  Washcloth  simulator 
[UCN12]  which  simulates  a  parallel  computer  consisting  of  multiple 
CDC-6600s  enhanced  with  Fetch-And-Add. 


The  semantics  of  Fetch-And-Add  are  as  follows:  If  a  PE  invokes 
sequence  FAA  (VARIABLE,  CONSTANT),  where  FAA  is  abbreviation  for 
FETCH-AND-ADD,  VARIABLE  is  shared,  CONSTANT  is  private  to  a  PE,  then: 

1)  the  old  value  of  VARIABLE  is  returned  to  the  invoker; 

2)  the  value  of  VARIABLE  is  incremented  by  CONSTANT. 

The  FAA  operation  is  indivisible  in  the  sense  that  the  result  of 
concurently  executed  FAAs  is  the  same  as  if  the  operations  were 
executed  in  some  unspecified  serial  order  (see  [UCN16]).  No  assumption 
is  made  about  relative  speeds  of  the  PEs. 
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Neither  SPAWN,   nor  WAIT   primitives  were  available  when  our 
FORTRAN-like  parallel  program  was  written.   Therefore  we  simulate  the 
"spawn-wait"  mechanism  by   using   the   following   code   in   FTN 
(CDC/FORTRAN). 

CALL  SYNCH  (MULTIPLICITY) 
PI  CALL  REQUEST(J),  RETURN(P2) 
(4.1)      CALL  TASK(J) 

GO  TO  PI 
P2  CONTINUE 

Here: 

PI  and  P2  are  FORTRAN  labels;   each  instance  of  the  task 

spawning  requires  its  own  labels; 

J  is  a  private  index  whose  value  is  requested  by  and  returned 

to  a  PE  for  executing  a  current  task; 

MULTIPLICITY  is  a  parameter  giving  the  total  number  of  tasks  to 

be  executed;  it  may  be  a  shared  variable  or  a  constant; 

Subroutines  SYNCH  and  REQUEST  (the  latter  with  an  alternate 
RETURN,  allowed  by  FTN)  are  given  in  fig.  3.1.  In  these  codes  we 
replace  the  FORTRAN  operator  "="  by  "<-"  and  the  comparison  operators 
".It."  and  ".gt."  by  "<"  and  ">",  respectively.  The  shared  variable 
DISJCOUNT  contains  the  value  of  #TASK  to  be  distributed  next. 

Subroutine  SYNCH  detains  the  PEs  until  all  of  them  complete  their 
previous  computation.  It  then  initializes  the  shared  variables  LIMIT 
and  DIS_COUNT  that  are  used  in  the  subroutine  REQUEST.  Finally,  the 
PEs  are  released. 
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An  invocation  of  REQUEST  either  returns  a  value  in  the  range  from 
1  to  MULTIPLICITY  used  to  distinguish  the  loop  iterations,  or  directs 
the  invoking  PE  to  label  P2,  if  all  tasks  are  already  taken  for 
execution.  Note  that  the  value  returned  is  unique  for  each  invocation 
of  REQUEST,  so  no  one  execution  of  TASK(J)  will  be  duplicated  and  all 
executions  will  be  initiated. 
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SUBROUTINE    SYNCH    (MULTIPLICITY) 
********************************************************************** 

*  SYNCHRONIZES   ALL   PES   AND   INITIALIZES   PARAMETERS   FOR  NEXT    SPAWNINGS 

***************************************************************************** 

*  LIST    OF    PUBLIC  MEMORY  LOCATIONS: 
PUBLIC   LIMIT,    DISJCOUNT,    #FREE_PES(2) ,    TOTAL 

*  LIST    OF   PRIVATE  MEMORY  LOCATIONS: 
PRIVATE    INDEX 

*  AT   EACH   INVOCATION   OF   THE   SUBROUTINE 

*  THE   OLD  VALUE   OF    INDEX  WILL   BE   KNOWN: 
DATA   INDEX   /I/ 

IF    (FAA   (#FREE_PES    (INDEX),    1)    <  TOTAL-1)   GO  TO    10 

*  THE   LAST    PE    INITIALIZES   DIS_COUNT  AND 

*  LIMIT   FOR  NEXT    SPAWNING,    AND 

*  #FREE_PES   FOR   (NEXT)2    SPAWNING: 
DIS_COUNT    <-       1 

LIMIT    <-  MULTIPLICITY 
#FREE_PES    (INDEX)    <-  0 
GO  TO   20 

*  ALL  PES   EXCEPT   FOR  THE   LAST   ONE 

*  ARE   BUSY-WAITING  HERE:  ^ 
10           IF    (#FREE_PES    (INDEX)    >   0)    GO  TO    10 

*  FLIP-FLOPING   INDEX   (1    <->   2): 
20   INDEX   <-   3-INDEX 

RETURN 


SUBROUTINE  REQUEST  (//TASK),  RETURNS  (LABEL) 
***************************************************************************** 

*  ASSIGNS  CURRENT  //TASK  TO  THE  PE; 

*  IF  NO  MORE  TASKS  ARE  LEFT,  CHOOSES  THE  ALTERNATIVE  RETURN  TO  LABEL 
***************************************************************************** 

*  LIST  OF  PUBLIC  MEMORY  LOCATIONS: 
PUBLIC  LIMIT,  DISJCOUNT,  //FREE_PES(2) ,  TOTAL 

*  LIST  OF  PRIVATE  MEMORY  LOCATIONS: 
PRIVATE  TASK 


//TASK  <-  FAA   (DIS_COUNT,    1) 

IF    (//TASK   >  LIMIT)    RETURN   LABEL 

RETURN 

Figure  3.1 
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5.  Scheduling  analysis  of  the  parallel  execution 
and  the  complexity. 

5.1.  Expression  of  parallel  complexity  via  serial  complexities.  We 
presume  that  the  entire  parallel  computer  consisting  of  P  PEs  is  to  be 
used  to  solve  the  problem,  and  we  evaluate  the  complexity  of  execution 
of  one  time  step  of  the  parallel  algorithm  in  terms  of  the  serial 
complexities  of  the  tasks.  (The  operating  system  will  provide  the 
possibility  of  running  several  jobs  concurrently  improving  utilization 
of  the  machine). 

Since  at  each  spawninig  all  tasks  have  essentially  the  same 
complexity  and  the  multiplicities  are  known  in  advance,  the  analysis  of 
the  scheduling  policy  is  especially  simple. 

For  a  given  spawning,  let  M  be  the  multiplicity  and  let  T  be  the 
time  required  to  execute  a  single  task.  To  the  complexity  M  T  of 
executing  the  tasks  in  the  serial  case  the  waiting  overhead 
W  =  MISMATCH  (M,  N)  x  T  must  be  added.  MISMATCH  (M,N)  is  equal  to  the 
number  of  "PE  x  task"  units  wasted  in  busy-waiting: 
MISMATCH  (M,  P)  =  ]^[  x  P  -  M,  where  ]x[  =  -  [-x]  is  the  ceiling  of  x. 

The  function 


(5.1)  Hp(M)  ~  £[  X  P 
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expresses   the   total   number   of   "PE  x  task"   units   used  for  a  given 
spawning. 

We  also  add  the  term  of  the  form  K  x  p  to  the  total  complexity, 
where  the  constant  K  expresses  the  minimal  number  of  instruction  that  a 
PE  has  to  execute  once  it  is  involved,  even  if  it  does  no  useful  work. 
This  parallelization  overhead  was  first  explicitly  noted  by  David  Korn 
in  [UCN24].  We  will  refer  to  the  terms  corresponding  to  this  overhead 
as  to  Korn^s  overhead. 

For  execution  by  P  PEs  of  L  successive  spawnings  of 
task.  , . . .  jtask^  with  executional  complexities  T]^,...,!^  and 
multiplicities  Mj^,...,M-l,  respectively,  we  get  the  following  expression 
for  complexity 


(5.2) 


C  (P)  =   I      H  (M. )  X  T.  +  P  X  K, 
i=l   P'  1'     1 


where  P  x  K  is  the  total  Korn's  overhead  over  all  L  spawnings. 

The  data  in  table  3.1  allow  (5.2)  to  be  rewritten  in  the  following 
form: 

(5.3)    C  (P)  =  Hp(Ni)  X  T^  +  Hp(N2)  x  T2  +  =p(Ni/2+l)  x  T3  +  P  x  K, 

where  T^^  T2,  and  T3  are  sums  of  complexities  of  those  tasks  that  have 
multiplicities  N^,  N2  and  Ni/2+1,  respectively. 
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5.2.  Experimental  verification  of  the  model  (5.3).  To  check 
validity  of  (5.3)  runs  of  the  parallel  algorithm  were  simulated  for 
various  problem  sizes^  and  numbers  of  processors  in  the  Ultracomputer . 
The  experiments  were  grouped  into  series.  For  the  runs  of  the  main 
series  N2  =  AxN^. 

Three  sizes  {N^,N2  =  4xNi}  of  the  problem  were  picked  with 
Nj^  =  4,  8  and  16.  For  each  size  the  complete  set  of  runs  with 
P  =  1,  2,..., 32  PEs  was  simulated  and  the  set  of  32  experimental 
complexities  C*(P)  was  obtained.  Then  the  system  of  32  equations  (5.3) 
with  left  hands  replaced  by  experimental  C  (P)  was  solved  with  =p(.) 
given  by  (5.1),  and  with  four  unknown  parameters:  K  and  three  unknown 
serial  complexities  T^,  T2,  T3.   The  least  square  method  was  used. 

The  results  of  these  computations  are  presented  in  the  first  three 

rows   of   table  5.1.   To  give  an  idea  of  the  variability  of  C*(P)  the 

minimal  value  C*(l)  and  the  maximal  value  C*^^  of   C*(P),   P  =  1,..,32, 

nia.x 

are  also  shown.  In  table  5.1  values  of  T,  K,  and  C*  are  given  in  units 
of  thousands  of  simulated  CDC-6600  cycles  (for  example,  according  to 
table  5.1   the   maximal   execution  complexity  of  a   problem  with 


^'According  to  C.  Peskin,  at  least  two  membrane  points  must  be  placed  in 
each  mesh  Ax^Ax  square  whose  opposite  sides  are  "pierced"  by  the 
membrane.  Since  N2  and  N^  are  powers  of  2  the  coefficient  r  in  the 
equality  N2  =  rxN^  must  be  a  power  of  2.  Considering  a  circular 
membrane  of  diameter  Ji  as  a  model  we  can  easily  see  that  r  =  4  is  the 
minimal  value  of  r  satisfying  these  limitations.  On  the  other  hand  for 
a  given  Nj^  an  excessively  large  r  (and  hence  N2)  leads  to  excessive 
computations  on  the  membrane  without  increase  of  accuracy.  Thus  r  =  4 
is  optimal  for  the  circle.  In  the  general  case,  the  optimal  choice  of 
r  depends  on  the  membrane  geometry. 
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(N^,  N2)  =  (4,16)  was  3109  thousand  cycles).  The  "error  %"  gives  the 
maximum  over  the  32  runs  of  the  relative  prediction  error 
|C(P)-C*(P) |/C*(P),  the  "error  abs"  is  similarly  the  maximum  of  the 
absolute  prediction  errors  1 (C(P)-C*(P) | .  (Note  that  "error  %"  may  be 
greater   than  _ ^   since   the  maximums  need  not  occur  for  the 


same  case) 


max 


PROBLEM  SIZE 
Ni,N2(#iter) 

Ti  1  T2 

1 

T3 

K 

C  (1) 

error   i 
%   1  absl 

C*(l) 

C* 
max 

4,  16  (6) 

7.6149.6 

1.6 

9.2 

1.1% 

.18%! 

2.61 

838. 

2176. 

8,  32  (5) 

17.3(48.0 

3.4 

8.5 

0.5% 

.19%! 

4.31 

1701. 

3882. 

16,  64  (5) 
4,  8   (5) 

39.3153.1 
7.4138.0 

8.3 
1.3 

9.4 
7.5 

0.2% 
2.2% 

.16%! 
.12%! 

6.9| 
.51 

4111. 
345. 

6700. 

========= 

1734. 

8.  16  (5) 

17.4142.9 

3.5 

8.1 

1.0% 

.23%! 

2.11 

850. 

2296. 

16,  32  (5) 

39.3|47.9 

8.5 

8.5 

0.4% 

.20%! 

4.71 

2246. 

4712. 

8,  8   (5) 

55.0 

3.3 

7.5 

1.6% 

.12%! 

.61 

464. 

2106. 

16,  16  (5) 

82.0 

8.3 

8.2 

0.6% 

.18%! 

2.71 

1395. 

3152. 

Table  5.1 


The  following  conclusions  were  derived  by  examining   the  main  series 
experiments : 

a)  The  model  (5.3)  allows  the  prediction  of  parallel  complexity  C(P) 
very  precisely  once  the  serial  complexities  T's  and  K  are  known. 

b)  Generally  all  T's  and  K  increase  when  the  problem  size   increases 
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(."^2        fo'^    the    (8,32)  problem   is    slightly   less   than   that   for 
(4,16)  problem  due  to  the  smaller  //iter). 

c)  Korn's  coefficient  K  may  increases  with  the  problem  size,  however 
the  ratio  K/C^(l)  steadily  decreases. 

5.3.  Additional  series  of  runs.  As  we  will  see  in  section  6  the 
conclusion  c)  is  especially  important  for  obtaining  the  efficiency  for 
large-scale  parallel  computations.  To  increase  our  confidence  in 
statement  c)  two  additional  series  of  runs  with  N^  =  2  Ni  and  No  =  Ni, 
respectively,  were  performed.  By  virtue  of  a)  these  additional  series 
were  run  for  fewer  number  of  ultracomputer  sizes.  Only  11  values  of  P 
were  picked,  P  =  1,  2,  3,  4,  6,  7,  8,  16,  24,  31,  32.  Note  that  in  the 
series  with  N^  =  Nj^  the  linear  equations  whose  coefficients  are  being 
identified  by  a  least  square  approximation  do  not  allow  the 
determination  of  T^^  and  T2  separately  since  the  corresponding  variables 
have  the  same  coefficient  ^p(Ni)  =  Hp(N2)  in  each  of  the  equations. 
For  this  series  the  estimations  of  T,  +  To  were  computed. 

The  results  of  these  additional  series  are  also  presented  in 
table  5.1  (in  the  second  and  the  third  sections  which  are  separated  by 
double-dashed  lines).  These  series  confirm  staement  c)  above. 

5.4.  Serial  complexities.  We  next  analysed  how  //iter,  Korn's 
coefficient,  and  the  serial  complexities  depend  on  the  problem 
parameters. 
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The  value  of  #iter  is  a  function  of  all  the  parameters  of  the 
problem,  not  only  of  problem  size.  Since  this  dependence  is  not  known, 
we  would  like  a  high  efficiency  for  any  value  of  //iter.  In  fact  this 
is  the  case;  see  section  6.  The  Newton's  minimization  algorithm  was 
set  for  at  least  two  iterations,  so  #iter  may  vary  from  2  to  +  o. 

Trying  to  build  a  complexity  model  for  Korn's  coefficient,  we 
analyse  all  sources  of  complexity  depending  on  P,  the  number  of  PEs . 
The  largest  portion  of  this  overhead,  due  to  a  mismatch  between  P  and 
the  multiplicity  of  task  spawning,  need  not  be  considered  here  since  it 
has  already  been  included  in  the  terms  Hp(Mj^)  x  Tj^  of  (5.2).  Other 
sources  of  the  overhead  are: 

The  REQUEST  routine  in  (4.1)  must  be  executed  by  each  PE  after  all 
currently  spawned  tasks  have  been  assigned.  The  number  of  instances  of 
task  spawnings  (4.11)  is  constant  for  the  part  of  the  program  dealing 
with  the  mesh.  Since  task  spawnings  are  embedded  into  the  Newton 
minimization  loop  which  is  executed  #iter  times,  the  number  of 
instances  of  task  spawnings  (4.11)  is  proportional  to 
(5.4)  A  +  (B  +  C  X  lgN2)  X  //iter 

for  the  part  dealing  with  membrane.   Therefore  the  overall  overhead  per 
PE  due  to  these  factors  can  be  expressed  as  (5.4). 

Line   10  in  SYNCH  code^  has   to  be  executed   by  all  but  one  PE 


''FORTRAN  FTN  compiler  of  the  CDC-6600  considers  line  10  as  a  mistake 
(which  it  surely  would  be  if  the  code  was  serial).  The  "parallel" 
FORTRAN  compiler  should  allow  line  10.  In  the  existing  code  "line  10" 
are  in  fact  two  lines,  using  no-op  CONTINUE  statement. 
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subsequenC   to  #FREE_PES  (INDEX)  becoming  0;  the  overhead  per  PE  can  be 
expressed  in  the  form  (5.4). 

To  make  our  prediction  of  K  safer  we  include  a   term  proportional 

to  N^  X  //iter  to  the  expression  of  K.  (Obtaining  a  small  coefficient 
for  this  term  would  indicate  that  we  did  not  raiss  components  of  K 
growing  much  faster  than  lgN2  x  //iter.)  Thus  we  come  to  the  following 
prediction  formula 

(5.5)  K  =  k.Q  +  (ki   +  k2   X   lgN2  +  k3   x   N2)    x    //iter 

The  four  coefficients  k^  in  (5.5)  were  identified  given  8 
experimental  values  of  K  from  table  5.1  using  a  least  square  method  as 
above.  The  results  are  presented  in  table  5.2,  when  the  error  columns 
have  the  same  meaning  as  in  table  5.1. 


^0  I  ^1  I  ^2  I  k3   I   error 
I     I     I      I  %   I  abs 

2.171.80  1.09  1.002  |2.3%|  .19 


Table  5.2 

The  experimental  k^  is  small  enough.  (Since  one  machine  cycle 
corresponds  to  the  addition  of  .001  to  a  coefficient,  this  k_ 
corresponds  to  only  two  cycles.)  This  confirms  our  analysis  of 
complexity  of  K. 


-27- 
The  analysis  of  the  dependency  of  T^^  T2  and  T3  on  problem 
parameters  could  be  done  by  actually  counting  instructions  in  the 
compiled  code.  This  method  however  was  rejected  because  of  the  large 
length  of  the  program  (the  compiled  serial  code  occupies  more  than  6000 
words,  each  containing  up  to  4  instructions).  We  tried  to  apply  the 
linear  identification  method  to  express  T's  as  functions  of  the  program 
size.  By  analyzing  table  3.1  one  sees  that  the  following  expressions 
might  be  candidates  for  identification  their  coefficients: 

(5.6.1)  Tj^  =  agi  +  a-ii   Nj^  +  a2ixN]^lgN j^ - 

(5.6.2)  T2  =  aQ2  +  ai2'^*it^^  "*■  a22'*^''iterxlgN2 

(5.6.3)  T3  =  ao3  +  aiS^N^  +  a23xNixlgNi 

Formulas  (5.6)  with  their  left-hand  sides  replaced  by  experimental 
values  obtained  from  table  5.1  allow  us  to  generate  a  number  of  linear 
equations  with  unknown  a^^^..  (To  use  experimental  values  of  T^  +  T2  in 
the  series  N^  =  N2  one  has  to  add  (5.6.1)  to  (5.6.2).)  14  equations 
were  solved  to  identify  6  coefficients  a.  .  in  (5.6.1)  and  (5.6.2)  and 
(separately)  8  equations  were  solved  to  identify  3  coefficients  a.  -  in 
(5.6.3).  The  results  of  these  identifications  are  presented  in 
table  5.3. 


i 

1      ^Oi 

1      aii 

1 
1 

a2i 

1 
1 

error 
%        1    abs 

1 

1        .34 

1    1.29 

1 

.29 

1 
—  1  \ 

1 
.94%    1    .16 

1 

2 

1      9.30 

1    2.65 

1 

1.02 

1  ; 
1 

3 

1         .61 

1    -.06( 

)l 

.14 

1 

11.%      1    .17 

Table  5.3 
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This  table  indicates  good  accuracy  for  a.  ,  and  a.  o  and  low 
accuracy  for  a.^^^,  which  means  that  models  (5.6.1)  and  (5.6.2)  are 
satisfactory  but  model  (5.6.3)  is  rather  coarse.  (Coefficient  a,^  even 
turns  out  negative!)  This  does  not  contradict  table  3.1,  which  gives 
only  the  order  of  complexity  for  each  task.   Since  table  5.1  shows  that 

T3   is   small,  in  (5.3)  the  term  involving  T3  is  also  small  and  we  have 
not  refined  this  model. 
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6.  The  estimation  of  efficiency. 

The  aim  of  this  section  is  to  study  how  the  efficiency  of  the 
parallel  algorithm  depends  on  the  number  P  of  PEs  and  on  the  problem 
size.  The  target  is  to  predict  efficiency  for  realistic  size  two  and 
three  dimensional  problems  that  are  too  large  for  us  to  simulate. 

6.1.  First  let  us  recall  the  commonly  accepted  definitions  of 
efficiency.  The  ideal  efficiency  E^^^^-'-  of  an  algorithm  executed  by  P 
PEs  is  given  by 

^best 
(6.1)  Eideal  =  s  

P     P T^TpT 

where  T^®^*^  is  the  time  required  by  the  best  possible  serial  algorithm, 
and  T  (p)  is  the  time  spent  by  the  parallel  algorithm  with  P  PEs. 

Since  the  best  serial  algorithm  is  usually  not  known,  T°^st  ^^ 
(6.1)  is  replaced  by  Tgoo^,  corresponding  to  the  serial  algorithm 
accepted  for  the  problem: 


rpgood 
(6.2)  Egeal  =    s 


P  X  Tp(p) 


Note  that  the  initial  serial  algorithm  was  accepted  as  "good"  in  our 
problem.  After  omitting  the  superscript  "real"  and  "good",  (6.2)  may 
be  rewritten  as 


(6.3)  E„  = 


Ts     T  (1) 


X 


P   Tp(l)    Tp(p-) 


Df   T 
The  ratio  e„  ==  __!_ 
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measures   the  loss  due   to  modifying   the 


serial   code   to  make  it  parallel.   Ratio  ep  == 


Df   T„(l) 


P  '^   Xp(P) 


measures  the 


loss  due  to  execution  of  the  parallel  code  by   P  processor  elements. 
Our  aim  now  is  to  evaluate  e^   and  ep  for  various  values  of  P. 

6.2.  Evaluation  of  e^^.  Table  6.1  contains  empirically  obtained 
values  of  e^  for  various  problem  sizes.  The  values  of  Tg  used  for 
calculating  e^  were  obtained  by  running  the  serial  code  in  a  special 
tracing  mode  provided  by  the  simulator  [UCN12]. 


N^,  N2  (//iter) 

4,  16 

(6)  1 

8,  32 

(5) 

16,  64 

(5) 

^0 

.925 

1 

.934 

.943 

N^,  N2  (#iter) 

4,  8 

(5)  1 

8,  16 

(5) 

16,  32 

(5) 

^0 

.915 

I 

.933 

.948 

N^,  N2  (#iter) 

1 

8,  8 

(5) 

16,  16 

(5) 

!o 

1    .933 

.956 

Table  6.1 


Table  6.1  shows  that  eg  steadily  increases  toward  one  when  both 
parameters  Nj,  and  N2  characterizing  the  problem  size  increase,  keeping 
N2/N]^  fixed.  Since  the  values  are  not  far  from  one,  we  conclude  that 
converting  of  the  serial  code  into  parallel  code  causes  low  overhead. 
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One  can  also  observe  the  following  irregularity  in  table  6.1:  for 
N^  =  4  and  8  the  value  of  eg  increases  when  N2  increases,   whereas   for 
N,  =  16  eg  decreases  when  N2  increases. 

This  effect  could  be  explained  by  expressing  Cq  as  the  weighted 

1       2 
sura  e^  =  Aixeg  +  ^2^^0»   where  Xj  and  X2     ^^^      the  weights   of   the 

execution  complexities   of   the  subcodes  corresponding  to  the  mesh  and 

1      2 
membrane,  respectively,  X.  +  X2  =  1,  eg  and  eg  are  the  efficiencies   of 

the   respective  subcodes.   These  partial   6g's  and  X's  depend  on  the 

problem  size.   Both  partial  eg's  generally  increase  with  the   size   of 

the  problem.   However  the  speed  of  this  increase  may  be  different  for 

different  sizes. 

There  would  be  no  contradiction  with  table  6.1   if,   for   the 

2     1 
interval   of  N2  given  in  that  table,  the  inequality  eg  >  eg  held  for 

smaller  N.  and  the  opposite  inequality  held  for  larger  Nj^.   Then  an 

2 
increase   in  N2  with  Nj^  fixed  would  cause  an  increase  in  Xg,  and  this 

would  also  increase  the  total  e^   for  small  Nj^  and  decrease  the  total  eg 

for  large  N. . 

6.3.  Efficiencies  for  small  2-D  problems.  Table  6.2  contains 
empirical  and  predicted  efficiencies  ep,  P  _>  2,  for  the  three  regular 
(i.e.  N2  =  4xNj^)  smallest  sizes  of  the  2-D  problem.  The  prediction 
was  based  on  the  formula 
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where  C  (P)  and  C  (1)  are  obtained  from  (5.3)   with   coefficients   from 
table   5.1.   The  predicted  values  are  given  in  parenthesis.   The 
agreement  is  seen  to  be  good. 


p 

PROBLEM  SIZE 

Np  N2 

(//iter) 

2 

4,  ] 
.9867 

L6  (6) 
(.9872) 

8,  : 

=  ^^  =  =  =  = 

.9927 

52  (5) 
(.9930) 

16, 
.9955 

64  (5) 
(.9957) 

3 

.8629 

(.8632) 

.9520 

(.9520) 

.9527 

(.9528) 

4 

.9644 

(.9663) 

.9789 

(.9793) 

.9870 

(.9872) 

7 

.7156 

(.7159) 

.8475 

(.8476) 

.8713 

(.8702) 

8 

.8888 

(.8906) 

.9585 

(.9605) 

.9707 

(.9707) 

15 

.4756 

(.4750) 

.6541 

(.6542) 

.7579 

(.7576) 

16 

.7691 

(.7700) 

.8466 

(.8484) 

.9523 

(.9538) 

31 

.3974 

(.3974) 

.4383 

(.4379) 

.6135 

(.6131) 

32 

.3850 

(.3850) 

.6866 

(.6878) 

.7864 

(.7871) 

48 

(.2566) 

(.4584) 

(.5248) 

64 

(.1925) 

(.3438) 

(.5833) 

Table  6.2 

6.4.  Estimating  efficiency  for  large  size  problems  can  not  be  done 
by  using  formulas  (5.3)  and  (6.4)  directly  since  #iter  is  not  known. 
If  one  substitutes  (5.5),  (5.6)  into  (5.3)  then  efficiency  appears  as  a 
rational  function  E(#iter)  of  the  form 


E(#iter)   = 


A  X  //iter  +  B 
C  X  ^;iter  +  D 


where  the  coefficients  A,  B,  C,  D  depend  on  the  problem  size  parameters 
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and  on  P.   Since  #iCer  may  vary  from  2  to  +00  and   A,  B,  C,  D  >  0,   the 
(asymptotic)   minimum   of  E(#iter)  is  E(+oo)  =  A/C,  if  AC  -  BD  <  0,  and 
is  E(2)  =  (2xA+B)/(2xC+D) ,  otherwise.   This  minimum  is  what  we  will 
call  Eg,  the  optimistic  estimate  of  ep. 

Since   predictions  for  the  serial  complexity  T^  is  rather  crude  we 

p 
now  develop  a  more   conservative  estimate  Ep,   which  we   call   the 

pessimistic  estimate.   Note  that  the  parallel  complexity  model  (5.3)  is 

considered  accurate  in  both  policies. 

Replacing  C  (P)  and  C  (1)  in  (6.4)  by  their  expressions  from  (5.2) 
and  carrying  out  the  obvious  algebraic  transformation  one  can  obtain 
the  following  expression  for  e^: 


■L       ,•        1 
(6.5)  e    =   Z  X   X  e^  +  Xq  X   , 


where  the  weights  X^,  i  =  1,,..,P,  are  given  by 

Hp(Mi)  X  Ti 


(6.6)  X. 


P       =p(Mi)  X  Ti  +. ..+  Hp(Mp)  X  Tl  +  P  X  K 
and  the  partial  efficiencies  ei  are  given  by 

(6.7)  e^   =   J_J_. 


Formula  (6.6)  is  valid  for  i  =  0,  if  one  takes  K  as  Tq  and  P  as  Xp(MQ), 
After  discarding  the  term  X„  x  _  in  (6.5)  one  has: 
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L       ,.  L  L 

(6.8)  e    >   z  X.  X  ep  =  (2  >^0  (^  ^^-i  >"   ep) 

^      i=l  ^  i=l  ^  1=1  ^    ^ 


where  new  weights  M^'s  are  given  by  \i±   =  o^   else 

E   X. 


Hp(Mi)    X  Ti 
(6.9)  M, 


1  -p(Mi)    X  Ti   +...+  =p(Mp)    X  Tl 


Note  that 


(6.10)     ix.    =    '  ^"\- "^^    >     ^  ^^^:/^^   =    i-p^^. 

i=i  1  TTF)  -  TTTl  cTTT 

L 
Substituting  the  expression  for   E  X    from  (6.10)  into  (6.8),  one  gets 

i=l  -^ 

the  following  lower  bound  on  ep  for  large  size  problems: 


K 


L 


(6.11)     ep   >    (1  -  P  X  _±_)    X  (Z  M.  X  e^) 


where  U^'s  are  given  by  (6.8)  and  ep's  are  given  by  (6.7). 

In  the  pessimistic  policy  we  estimate  the  ratio  _^___  using   the 

C  (1)      ^ 

minimal  of  its  observed  values  in  the  small  size  experiments  (see 
section  5),  i.e.  by  0.2%  and  in  (6.11)  the  "vi  .-weighted"  sum  is 
replaced  by  the  minimum  of  e^  over  all  i  =  1,...,L.   Hence 

(6.12)    Ep  =  (1  -  P  X  0.002)  X  (min  (e™,  e^,  e^^ ) ) 


where  e^,  ep,  ep  are  partial  efficiencies  corresponding  T^^,  T2  and  T3, 
respectively. 
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Entries   in   table   6.3  above  are  E(5)/E°/Ep.      Symbol      " — "      replaces 
negative   E^. 


p 

PROBLEM  SIZE   N^,  N2 

2 

32,   128 
.997/. 996/. 941 

64,   256 
.998/. 997/. 967 

128,   512 
.999/. 998/. 981 

256,   1024 

=============== 

.999/. 999/. 988 

4 

.991/. 989/. 843 

.994/. 993/. 909 

.996/. 995/. 948 

.998/. 997/. 969 

8 

.98/. 97/. 70 

.99/. 98/. 81 

.99/. 99/. 89 

.99/. 99/. 93 

16 

.96/. 95/. 51 

.97/. 96/. 67 

.98/. 98/. 77 

.99/. 99/. 87 

32 

.94/. 94/. 50 

.94/. 93/. 48 

.96/. 95/. 63 

.98/. 97/. 75 

64 

.71/. 62/. 23 

.93/. 92/. 45 

.92/. 91/. 44 

.95/. 95/. 59 

128 

.47/. 38/. 10 

.62/. 55/. 19 

.91/. 91/. 38 

.91/. 90/. 37 

256 

.23/. 19/. 03 

.37/. 31/. 06 

.55/. 50/. 12 

.90/. 90/. 25 

512 

.12/.09/~ 

.19/.15/~ 

|.30/.27/~ 

.50/. 47/  — 

1024 

.06/.05/~ 

.09/.08/~ 

|.15/.13/~ 

I. 26/. 24/-- 

Table  6.3 


Data   in  table  6.3  show  that  parallelization  is  efficient  in  the 
2-dimensional  case  for  up  to  a  few  hundred  PEs . 


6.5.  Efficiency  prediction  for  3-D  problems.  In  the  3-D  case,  the 
membrane  is  2-dimensional  and  many  different  structures  are  possible. 
We  suppose,  however,  that  the  membrane  is  reinforced  by  a  collection  of 
one-dimensional  fibers  and  that  the  minimization  problem  (2.3)-(2.5)  is 
to  be  solved  independently  on  each  fiber  in  this  collection.  Such  a 
collection  can  be   treated  as  a  single  large  fiber.   (This  may  not  be 
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the  most  efficient  method,  but  it  is  the  one   we   can  analyse   without 
further  programming.) 

Let  the  total  number  of  membrane  points  be  N2  =  8  X  N]^~.  This 
relation  may  be  considered  as  optimal  in  the  considered  case  (cf. 
footnote  on  p.  22).  We  will  assign  the  same  amount  of  work  for  each 
task,  but  replace  multiplicities  N.  by  Ni  for  tasks  dealing  with  mesh. 
(The  multiplicity  ___  +  1  of  a  non-standard  FFT  task  must  be  replaced  by 
N,  X  (---  +  1).)  Orders  of  complexities  of  tasks  remain  the  same  as  in 
table  3.1.  Whereas  the  number  of  invocations  required  and  thus  the 
values  of  coefficients  a^.  in  (5.7)  may  change,  we  will  not  try  to 
predict  a^.  and  k^^^  in  the  3-D  case.  Instead  we  use  the  2-D  values. 
Table  6.4  presents  the  E°  predictions  obtained. 

Note  that  Ep  increases  when  a^.  increase.  (This  may  be  proven 
analytically.  Since  the  formula  for  computing  E°  was  implemented  as  a 
computer  program,  we  prefered  to  check  it  experimentally.  For  the 
problem  size  N^  =  256,  N2  =  524288  the  values  of  the  a^js  were 
independently  and  uniformly  varied  in  the  range  from  the  nominal  value 
to  the  150%  of  it.  All  1000  computed  values  of  trial  E°„„,  appears  to 
be  greater  than  the  nominal  value  of  this  quantity.) 
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p 

PROBLEM 

SIZE   N^,  N2 

2 

32,  8192 
.9999 

1   64 

32768 
.9999 

128,  131072 
.9999 

256,  524288 
.9999 

4 

.9996 

.9997 

.9997 

.9997 

8 

.9991 

.9993 

.9993 

.9994 

16 

.998 

.998 

.999 

.999 

32 

.996 

.997 

.997 

.997 

64 

.992 

.993 

.994 

.995 

128 

.984 

.987 

.988 

.989 

256 

.968 

.974 

.977 

.979 

512 

.937 

.949 

.955 

.959 

1024 

.882 

.902 

.913 

.921 

2048 

.692 

.821 

.840 

.854 

4096 

.471 

.697 

.724 

.744 

8192 

.288 

.527 

.568 

.593 

Table  6.4 


Data  in  table  6.3  shows  that  parallelization  in  the  3-dimensional 
case  is  efficient  for  up  to  a  few  thousand  of  PEs .  In  particular,  a 
problem  of  size  (64,32768)  (a  size  C.  Peskin  believes  to  be  of 
interest)  can  efficiently  utilize  an  entire  Ultracomputer  consisting  of 
4096  PEs. 
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7,  The  pattern  of  the  Ultracomputer  activity 
with  network  delay  simulation. 

Several  additional  runs  of  the  program  were  simulated  more 
accurately  by  including  delays  in  the  network  connecting  PEs  with 
shared  memory  (SM).  This  was  done  using  the  NETSIM  simulator  [UCN28] 
for  a  6-stage  network  of  4  4  switches  (see  [UCN40]).  This  corresponds 
to  the  Ultracomputer  with  4096  PEs.  In  our  simulations  a  given 
fraction  of  the  Ultracomputer  (4,  8  and  16  PEs,  respectively)  was 
executing  our  program  while  the  rest  of  the  Ultracomuter  was  idle,  i.e. 
did  not  produce  additional  network  traffic. 

Note  that  a  request  to  the  shared  memory  (SM)  on  an  unloaded 
nerwork  proceeds  for  8  cycles,  so  that  the  delay  is  at  least  7  more 
cycles  than  in  the  simulations  described  above. 

The  overall  additional  delay  of  the  simulation  and  various  memory 
referencing  statistics  for  runs  on  a  problem  of  sizes  N,  =8,  No  =  32 
(//iter  =  5)  are  represented  in  following  table  7.1. 

The   effective  average  delay  per  SM  load  (row  7  in  table  7.1)  was 

computed  as 

#PEs  X  (tj  -  t„) 
//SM_loads 

where  t^,  t^j  are  times  of  execution  without  (row  1)   and   with   (row  2) 
network  delay,   respectively;   //SM  loads  is  computed  as  #SM  references 
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(row  4)  multiplied  by  %SM_loads  (row  5).   This  delay  is  less   than  the 
network  latency  due  to  prefetching. 


//ROW 

STATISTICS 

time  of  execution 
without  delay 

NUMBER  OF  PES  EXECUTING 

THE  PROGRAM 

1 

4 
434526 

8 
221885 

16 
125603 

2 

time  of  execution 
with  delay 

581570 
(+34%) 

296630 
(+34%) 

172797 
(+38%) 

3 

#  of  data  memory 
references 

596480 

589448 

671314 

4 

^^   of  SM  references 

134388 
(23%  of  all) 

142639 
(24%  of  all) 

180101 
(27%  of  all) 

5 

%  of  all  of  SM 
loads /s tores /FAAs 

82%/15%/3% 

82%/15%/3% 

84%/13%/3% 

6 

average  delay  per 
loads/stores/FAAs 

9.54/9.55/10.4 

8.97/8.96/9.41 

8.76/8.77/9.03 

7 

effective  average 

6.36 

6.14 

6.17 

8 

%  of  all  of  local 
loads/stores 

70%/30% 

70%/30% 

7 2%/ 2 8% 

Table  7.1 


The  delay  through  the  network  is  another  factor  decreasing 
performance  of  the  parallel  algorithm.  According  to  table  7.1  this 
delay  is  not  excessive.  It  seemingly  does  not  increase  when  degree  of 
parallelization  P  increases  if  P  remains  less  than  N. . 
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It  is  likely  that  idle  waiting  creates  more  network  traffic  than 
normal  processing.   This  is  confirmed  by: 

1)  For  P  >  N^  idle  waiting  intensifies  the  stream  of  idle  SM 
references  inside  SYNCH  loop  10  (see  fig.  3.1)  whose  partial  delay  must 
be  greater  than  the  average; 

2)  The  number  of  SM  references  noticably  increases  for  case  of  16 
PEs;  while  the  fraction  of  SM  loads  also  increses.  Note  that  one 
execution  of  loop  10  at  SYNCH  routine  corresponds  to  one  SM  load; 

3)  The  delay  increses  when  number  of  PEs  increases  from  8  (delay 
34%)  to  16  (delay  38%). 

Note  that  if  the  operating  system  uses  idle  PEs  for  other  jobs  the 
overall  delay  might  decrease. 


8.  Conclusion. 

The  algorithms  of  the  considered  class  can  be  easily  adapted  for 
parallel  processing  on  the  Ultracomputer ,  using  task  spawning  mechanism 
for  loop  parallelization.  Various  simulation  results  presented  show 
high  efficiency  of  the  execution  with  hundreds  of  PEs  in  the  2-D  case 
and  with  thousands  of  PEs  in  the  3-D  case.  The  degree  of 
parallelization  of  successive  task  spawnings  varies  and  the  problem 
execution  generates  numerous  busy  waits.  Therefore  the  throughput  of 
the  machine  will  be  improved  without  lowering  the  speed-up  for  the  main 
job,  if  several  small  low  priority  jobs  are  running  concurrently  with 
it. 
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